…_EMPTY
PPLFuncImpTable was emitting SqlStdOperatorTable.IS_EMPTY against string
operands as part of its isempty(x) / isblank(x) lowering. That operator
is the SQL:2003 multiset emptiness predicate — its OperandTypeChecker is
COLLECTION_OR_MAP and its enumerable-runtime binding calls
java.util.Collection.isEmpty() reflectively. Passing a string slipped
through only because RexBuilder.makeCall bypasses the operand checker
and Calcite's Linq4j codegen emits a bare `target.isEmpty()` that happens
to bind to String.isEmpty() at Janino compile time. Outside of the
script-pushdown path, the abuse breaks: any backend that does not run
RexNodes through Calcite's enumerable codegen — most notably Substrait
emitters such as the analytics-engine route — has no isthmus mapping
for SqlStdOperatorTable.IS_EMPTY and rejects the plan with
"unresolved function reference IS EMPTY".
Replace the lowering with the predicate that actually matches the doc:
isempty(x) -> OR(IS_NULL(x), CHAR_LENGTH(x) = 0)
isblank(x) -> OR(IS_NULL(x), CHAR_LENGTH(TRIM(BOTH ' ' FROM x)) = 0)
CHAR_LENGTH is a standard string operator with explicit type semantics,
backed by SqlFunctions.charLength(String) in Calcite's enumerable
runtime, and it round-trips through Substrait's default extension
catalog. Same observable behavior on existing PPL paths; correct
behavior on Substrait-based paths.
Also remove PredicateAnalyzer.containIsEmptyFunction. It existed solely
to abort DSL pushdown when it saw the OR(IS_NULL, IS_EMPTY) shape
because the SHOULD branches would NPE on null operands evaluating
.isEmpty() through the script. The new shape produces
OR(IS_NULL, CHAR_LENGTH = 0); the IS_NULL arm pushes down to
must_not.exists, the CHAR_LENGTH arm to a script that returns null
(not NPE) for null fields, and the SHOULD union answers correctly.
Test plan:
- CalcitePPLConditionBuiltinFunctionIT.testIsEmpty / testIsBlank pass.
- CalciteExplainIT.testExplainIsEmpty / testExplainIsBlank /
testExplainIsEmptyOrOthers pass with regenerated golden plans for
both pushdown-enabled and no-pushdown variants.
- PredicateAnalyzerTest passes, including the unrelated
equals_scriptPushDown_Struct test that still uses
SqlStdOperatorTable.IS_EMPTY for its legitimate map-emptiness
semantics.
Signed-off-by: Songkan Tang <songkant@amazon.com>
Description
PPL's
isempty(x)/isblank(x)were lowered inPPLFuncImpTableusingSqlStdOperatorTable.IS_EMPTYagainst string operands. That operator is the SQL:2003 multiset emptiness predicate — itsOperandTypeCheckerisOperandTypes.COLLECTION_OR_MAPand its enumerable-runtime binding callsjava.util.Collection.isEmpty()reflectively.Passing a string slipped through only because:
RexBuilder.makeCallbypasses the operand checker.target.isEmpty()(a bare method-name call), which Janino then resolves against the actual operand type.String.isEmpty()happens to exist with the same signature, so the script-path runtime works by coincidence.This shape is fragile:
SqlStdOperatorTable.IS_EMPTY, and the plan is rejected with"unresolved function reference IS EMPTY".PredicateAnalyzerhad to special-case-block DSL pushdown for theOR(IS_NULL, IS_EMPTY)shape because OpenSearch's non-short-circuitingbool.shouldwould NPE evaluatingname.isEmpty()on a null operand. So the entire OR was wrapped in a single script_query — a strict pessimization compared to a mixed bool.should.Issues Resolved
This PR replaces the lowering with the predicate that actually matches the documented semantics:
CHAR_LENGTHis a standard string operator with explicit type semantics, backed bySqlFunctions.charLength(String)in Calcite's enumerable runtime, and round-trips through Substrait's default extension catalog. Same observable behavior on existing PPL paths; correct behavior on Substrait-based paths.The
containIsEmptyFunctionguard inPredicateAnalyzer.andOris removed:CHAR_LENGTH(null)follows Calcite'sNullPolicy.STRICTand returns null rather than NPE, so DSL's non-short-circuitingshouldevaluation is safe even when the field is null. This also unlocks better pushdown — theIS_NULLarm is now a nativebool.must_not.existsclause and only theCHAR_LENGTH=0arm remains a script. Null docs no longer touch the script engine.The
isNullOr_ScriptPushDowntest was renamed toisEmpty_pushesIsNullArmAsExistsAndCharLengthArmAsScriptand rewritten to assert the actual structural shape of the pushdown (2-arm bool.should with the expected types per arm) instead of just substring-matching the script lang marker.Plan changes are reflected in the regenerated explain golden files for both
calcite/andcalcite_no_pushdown/variants.Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.